Overall Direction

TODO

  1. we are choosing log(price) as the transform for this predictor. Justify.
  2. decide between 2 possible transforms for the response, boxCoxTransform at optimal lambda, or log(points). Justify with regsubsets, which is better re the assumptions of MLR, and smaller average RMSE over 10-fold cross validation.
  3. continentTopFive is better than just continent as a categorical predictor. Justify by comparing integrated models. Ex. now we’re at boxCoxTrans(points) ~ log(price)*continentTopFive as the model.

Choice between continent and continentTopFive as first categorical

Compare BoxCox optimal versus log(points) transform of response

## 
##  studentized Breusch-Pagan test
## 
## data:  mod
## BP = 4864.1, df = 13, p-value < 2.2e-16
## 
##  Shapiro-Wilk normality test
## 
## data:  sample(resid(mod), 5000)
## W = 0.99665, p-value = 3.615e-09
## 
##  studentized Breusch-Pagan test
## 
## data:  mod
## BP = 4864.1, df = 13, p-value < 2.2e-16
## 
## Call:
## glm(formula = boxcoxTrans(points, optimal_lamda, 0) ~ log(price) * 
##     continentTopFive, data = wine_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -133.562   -14.671     0.719    15.653    91.658  
## 
## Coefficients:
##                                         Estimate Std. Error t value
## (Intercept)                             463.0775     3.1920 145.075
## log(price)                               25.5649     1.0856  23.550
## continentTopFiveAsia                     10.7109     6.8207   1.570
## continentTopFiveAustralia                -5.0864     3.5616  -1.428
## continentTopFiveEuropeTop5               -2.6382     3.2394  -0.814
## continentTopFiveNorthAmerica            -19.4814     3.2543  -5.986
## continentTopFiveOtherEurope              -8.6811     3.9573  -2.194
## continentTopFiveSouthAmerica            -14.8069     3.4298  -4.317
## log(price):continentTopFiveAsia          -6.6737     2.1369  -3.123
## log(price):continentTopFiveAustralia      1.4512     1.1940   1.215
## log(price):continentTopFiveEuropeTop5     0.6644     1.0980   0.605
## log(price):continentTopFiveNorthAmerica   4.1913     1.1016   3.805
## log(price):continentTopFiveOtherEurope    3.0472     1.3077   2.330
## log(price):continentTopFiveSouthAmerica   2.8761     1.1716   2.455
##                                         Pr(>|t|)    
## (Intercept)                              < 2e-16 ***
## log(price)                               < 2e-16 ***
## continentTopFiveAsia                    0.116336    
## continentTopFiveAustralia               0.153265    
## continentTopFiveEuropeTop5              0.415405    
## continentTopFiveNorthAmerica            2.15e-09 ***
## continentTopFiveOtherEurope             0.028259 *  
## continentTopFiveSouthAmerica            1.58e-05 ***
## log(price):continentTopFiveAsia         0.001790 ** 
## log(price):continentTopFiveAustralia    0.224211    
## log(price):continentTopFiveEuropeTop5   0.545123    
## log(price):continentTopFiveNorthAmerica 0.000142 ***
## log(price):continentTopFiveOtherEurope  0.019793 *  
## log(price):continentTopFiveSouthAmerica 0.014092 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 548.6045)
## 
##     Null deviance: 91898757  on 102921  degrees of freedom
## Residual deviance: 56455794  on 102908  degrees of freedom
## AIC: 941264
## 
## Number of Fisher Scoring iterations: 2

## [1] 90.07612 85.07601 94.94174
## [1] 0.0181028
## [1] 548.6591
## [1] 10.96185

## 
##  studentized Breusch-Pagan test
## 
## data:  mod1
## BP = 3905.5, df = 13, p-value < 2.2e-16
## 
##  Shapiro-Wilk normality test
## 
## data:  sample(resid(mod1), 5000)
## W = 0.99545, p-value = 2.185e-11
## 
##  studentized Breusch-Pagan test
## 
## data:  mod1
## BP = 3905.5, df = 13, p-value < 2.2e-16
## 
## Call:
## glm(formula = log(points) ~ log(price) * continentTopFive, data = wine_train)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## -0.166397  -0.018153   0.001614   0.019576   0.102402  
## 
## Coefficients:
##                                           Estimate Std. Error  t value
## (Intercept)                              4.3764767  0.0039361 1111.887
## log(price)                               0.0315053  0.0013386   23.536
## continentTopFiveAsia                     0.0120168  0.0084106    1.429
## continentTopFiveAustralia               -0.0048292  0.0043919   -1.100
## continentTopFiveEuropeTop5              -0.0013714  0.0039945   -0.343
## continentTopFiveNorthAmerica            -0.0229106  0.0040129   -5.709
## continentTopFiveOtherEurope             -0.0095902  0.0048798   -1.965
## continentTopFiveSouthAmerica            -0.0192454  0.0042294   -4.550
## log(price):continentTopFiveAsia         -0.0079028  0.0026350   -2.999
## log(price):continentTopFiveAustralia     0.0012709  0.0014723    0.863
## log(price):continentTopFiveEuropeTop5    0.0001409  0.0013539    0.104
## log(price):continentTopFiveNorthAmerica  0.0046728  0.0013583    3.440
## log(price):continentTopFiveOtherEurope   0.0033206  0.0016125    2.059
## log(price):continentTopFiveSouthAmerica  0.0037301  0.0014447    2.582
##                                         Pr(>|t|)    
## (Intercept)                              < 2e-16 ***
## log(price)                               < 2e-16 ***
## continentTopFiveAsia                    0.153077    
## continentTopFiveAustralia               0.271515    
## continentTopFiveEuropeTop5              0.731350    
## continentTopFiveNorthAmerica            1.14e-08 ***
## continentTopFiveOtherEurope             0.049384 *  
## continentTopFiveSouthAmerica            5.36e-06 ***
## log(price):continentTopFiveAsia         0.002708 ** 
## log(price):continentTopFiveAustralia    0.388032    
## log(price):continentTopFiveEuropeTop5   0.917124    
## log(price):continentTopFiveNorthAmerica 0.000582 ***
## log(price):continentTopFiveOtherEurope  0.039468 *  
## log(price):continentTopFiveSouthAmerica 0.009825 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.0008341881)
## 
##     Null deviance: 138.196  on 102921  degrees of freedom
## Residual deviance:  85.845  on 102908  degrees of freedom
## AIC: -437523
## 
## Number of Fisher Scoring iterations: 2

##        fit      lwr      upr
## 1 90.00091 85.00703 95.28817
## [1] 0.03654621
## [1] 0.0008342488
## [1] 1.029305

LDA

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 2
## 
## [1] This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background. Balanced and rewarding from start to finish, it has years ahead of it to develop further nuance. Enjoy 2022–2030.
## [2] Ripe aromas of fig, blackberry and cassis are softened and sweetened by a slathering of oaky chocolate and vanilla. This is full, layered, intense and cushioned on the palate, with rich flavors of chocolaty black fruits and baking spices. A toasty, everlasting finish is heady but ideally balanced. Drink through 2023.
## # A tibble: 65,302 x 3
##    topic       term         beta
##    <int>      <chr>        <dbl>
##  1     1        100 4.663715e-04
##  2     2        100 7.081641e-04
##  3     1   20222030 6.438382e-07
##  4     2   20222030 1.116309e-06
##  5     1        age 4.351469e-03
##  6     2        age 2.740947e-03
##  7     1      ahead 3.675568e-05
##  8     2      ahead 7.825068e-05
##  9     1 background 2.853492e-04
## 10     2 background 1.755885e-04
## # ... with 65,292 more rows

## # A tibble: 267 x 4
##        term      topic1       topic2   log_ratio
##       <chr>       <dbl>        <dbl>       <dbl>
##  1   accent 0.000597273 2.168703e-03  1.86037010
##  2     acid 0.019160079 1.820266e-03 -3.39588224
##  3      add 0.001084956 1.153448e-03  0.08831623
##  4      age 0.004351469 2.740947e-03 -0.66682783
##  5  alcohol 0.001293881 1.779603e-03  0.45985068
##  6   almond 0.001585721 2.607108e-04 -2.60461729
##  7   almost 0.001079766 1.911925e-03  0.82430696
##  8    along 0.001881767 5.952606e-04 -1.66049514
##  9     also 0.003160329 9.200576e-04 -1.78027848
## 10 although 0.001887905 7.860205e-06 -7.90800368
## # ... with 257 more rows

Sentiment Analysis

wine_train_lda = data.frame(wine_train, topic1 = corpus_documents$gamma[1:nrow(wine_train)], document = rownames(wine_train))
wine_train_sentiment = merge(wine_train_lda, corpus_sentiments[,c(1,4)], by = "document")

km.out = kmeans(wine_train_sentiment$sentiment, 3)
plot(wine_train_sentiment$sentiment, col=(km.out$cluster + 1))

wine_train_clustered = data.frame(wine_train_sentiment, cluster = km.out$cluster)

Adding topic1 to the model

## 
##  studentized Breusch-Pagan test
## 
## data:  mod2
## BP = 3581.2, df = 28, p-value < 2.2e-16
## 
##  Shapiro-Wilk normality test
## 
## data:  sample(resid(mod2), 5000)
## W = 0.99217, p-value = 5.541e-16
## 
##  studentized Breusch-Pagan test
## 
## data:  mod2
## BP = 3581.2, df = 28, p-value < 2.2e-16
## 
## Call:
## glm(formula = log(points) ~ log(price) * continentTopFive * cluster + 
##     topic1, data = wine_train_clustered)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## -0.130445  -0.018064   0.001548   0.019782   0.102926  
## 
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                      4.391240   0.012356
## log(price)                                       0.027842   0.003906
## continentTopFiveAsia                             0.001249   0.023942
## continentTopFiveAustralia                       -0.015017   0.012836
## continentTopFiveEuropeTop5                      -0.010464   0.011670
## continentTopFiveNorthAmerica                    -0.034551   0.011730
## continentTopFiveOtherEurope                     -0.014660   0.013949
## continentTopFiveSouthAmerica                    -0.033462   0.012315
## cluster                                         -0.004750   0.004762
## topic1                                          -0.010365   0.009020
## log(price):continentTopFiveAsia                 -0.003709   0.007512
## log(price):continentTopFiveAustralia             0.005764   0.004292
## log(price):continentTopFiveEuropeTop5            0.003928   0.003947
## log(price):continentTopFiveNorthAmerica          0.009176   0.003961
## log(price):continentTopFiveOtherEurope           0.006040   0.004604
## log(price):continentTopFiveSouthAmerica          0.008932   0.004193
## log(price):cluster                               0.001867   0.001614
## continentTopFiveAsia:cluster                     0.006376   0.009796
## continentTopFiveAustralia:cluster                0.004392   0.005316
## continentTopFiveEuropeTop5:cluster               0.004285   0.004827
## continentTopFiveNorthAmerica:cluster             0.005156   0.004850
## continentTopFiveOtherEurope:cluster              0.003699   0.005812
## continentTopFiveSouthAmerica:cluster             0.006518   0.005096
## log(price):continentTopFiveAsia:cluster         -0.002328   0.003061
## log(price):continentTopFiveAustralia:cluster    -0.001958   0.001775
## log(price):continentTopFiveEuropeTop5:cluster   -0.001809   0.001631
## log(price):continentTopFiveNorthAmerica:cluster -0.002107   0.001636
## log(price):continentTopFiveOtherEurope:cluster  -0.001728   0.001918
## log(price):continentTopFiveSouthAmerica:cluster -0.002419   0.001733
##                                                 t value Pr(>|t|)    
## (Intercept)                                     355.397  < 2e-16 ***
## log(price)                                        7.128 1.03e-12 ***
## continentTopFiveAsia                              0.052  0.95838    
## continentTopFiveAustralia                        -1.170  0.24202    
## continentTopFiveEuropeTop5                       -0.897  0.36992    
## continentTopFiveNorthAmerica                     -2.946  0.00322 ** 
## continentTopFiveOtherEurope                      -1.051  0.29327    
## continentTopFiveSouthAmerica                     -2.717  0.00659 ** 
## cluster                                          -0.998  0.31852    
## topic1                                           -1.149  0.25051    
## log(price):continentTopFiveAsia                  -0.494  0.62146    
## log(price):continentTopFiveAustralia              1.343  0.17930    
## log(price):continentTopFiveEuropeTop5             0.995  0.31959    
## log(price):continentTopFiveNorthAmerica           2.316  0.02053 *  
## log(price):continentTopFiveOtherEurope            1.312  0.18963    
## log(price):continentTopFiveSouthAmerica           2.130  0.03314 *  
## log(price):cluster                                1.157  0.24721    
## continentTopFiveAsia:cluster                      0.651  0.51514    
## continentTopFiveAustralia:cluster                 0.826  0.40868    
## continentTopFiveEuropeTop5:cluster                0.888  0.37462    
## continentTopFiveNorthAmerica:cluster              1.063  0.28779    
## continentTopFiveOtherEurope:cluster               0.636  0.52452    
## continentTopFiveSouthAmerica:cluster              1.279  0.20087    
## log(price):continentTopFiveAsia:cluster          -0.761  0.44689    
## log(price):continentTopFiveAustralia:cluster     -1.103  0.27015    
## log(price):continentTopFiveEuropeTop5:cluster    -1.109  0.26724    
## log(price):continentTopFiveNorthAmerica:cluster  -1.288  0.19778    
## log(price):continentTopFiveOtherEurope:cluster   -0.901  0.36783    
## log(price):continentTopFiveSouthAmerica:cluster  -1.396  0.16281    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.0008383856)
## 
##     Null deviance: 120.669  on 89226  degrees of freedom
## Residual deviance:  74.782  on 89198  degrees of freedom
## AIC: -378841
## 
## Number of Fisher Scoring iterations: 2
## [1] 0.004087244
## [1] 0.0008385478
## [1] 1.029381